Stratified Sampling Meets Machine Learning
نویسندگان
چکیده
This paper solves a specialized regression problem to obtain sampling probabilities for records in databases. The goal is to sample a small set of records over which evaluating aggregate queries can be done both efficiently and accurately. We provide a principled and provable solution for this problem; it is parameterless and requires no data insights. Unlike standard regression problems, the loss is inversely proportional to the regressed-to values. Moreover, a cost zero solution always exists and can only be excluded by hard budget constraints. A unique form of regularization is also needed. We provide an efficient and simple regularized Empirical Risk Minimization (ERM) algorithm along with a theoretical generalization result. Our extensive experimental results significantly improve over both uniform sampling and standard stratified sampling which are de-facto the industry standards.
منابع مشابه
Support Vector Machine based on Stratified Sampling
Support vector machine is a classification algorithm based on statistical learning theory. It has shown many results with good performances in the data mining fields. But there are some problems in the algorithm. One of the problems is its heavy computing cost. So we have been difficult to use the support vector machine in the dynamic and online systems. To overcome this problem we propose to u...
متن کاملAccelerating Minibatch Stochastic Gradient Descent using Stratified Sampling
Stochastic Gradient Descent (SGD) is a popular optimization method which has been applied to many important machine learning tasks such as Support Vector Machines and Deep Neural Networks. In order to parallelize SGD, minibatch training is often employed. The standard approach is to uniformly sample a minibatch at each step, which often leads to high variance. In this paper we propose a stratif...
متن کاملCalibration for Stratified Classification Models
In classification problems, sampling bias between training data and testing data is critical to the ranking performance of classification scores. Such bias can be both unintentionally introduced by data collection and intentionally introduced by the algorithm, such as under-sampling or weighting techniques applied to imbalanced data. When such sampling bias exists, using the raw classification ...
متن کاملConvergence Optimization of Backpropagation Artificial Neural Network Used for Dichotomous Classification of Intrusion Detection Dataset
There are distinguished two categories of intrusion detection approaches utilizing machine learning according to type of input data. The first one represents network intrusion detection techniques which consider only data captured in network traffic. The second one represents general intrusion detection techniques which intake all possible data sources including host-based features as well as n...
متن کاملGrover Title of Thesis : Active Learning and its Application to Heteroscedastic Problems
This thesis presents Active Learning algorithms for heterogeneous distributions. Active Learning is a vast and growing sub-field of Machine Learning, where many significant contributions have been made. This thesis makes two contribution to the Active Learning field. First contribution is a broad survey of Active Learning literature and second contribution is two new Active Learning algorithms ...
متن کامل